Airbnb Paris dataset analysis

Datacamp 2021

Alexandre PERBET
Cyril NERIN
Hugo RIALAN
Paul ORLUC
Walid CHRIMNI
Zakaria BEKKR



INTRODUCTION

Airbnb is a community platform service that connects travelers with hotel companies, rental property investors, and individuals who rent out all or part of their own home as a spare home. The site offers a search and booking platform between the person offering their accommodation and a renter. It covers more than 1.5 million rental ads in over 34,000 cities and 191 countries. In our study, we will restrict ourselves to the city of Paris.

We will use Machine Learning algorithms to predict the price of an Airbnb rental in Paris.

The use case of our work would be :

Data source : http://insideairbnb.com/get-the-data.html

Libraries

Utility functions

Load Data

Exploratory data analysis

The goal of this part is to compute exploratory data analysis on the dataset to have a good overview of it and gain information.

As some part require accurate information, we also do some little preprocessing to have the best exploratory analysis possible.

Data dictionary

The data from df_main

We notice that the attribute "neighbourhood_group" is not filled in

Location of the apartments for rent

Textual variables

column "name"

We can see that the description of the apartments listed in our dataset are skewed towards upper class paris areas. We can also see that there is especially succesful appartment quality adjectives like : cosy, bright, charming etc

Column "neighbourhood"

The data from df_listings

Construction of the dataset that will be used for the analysis

Cleaning the data set

Dealing with Missing Values

Columns deletion

Neighbourhood_group

We note that the neighbourhood_group has only one unique value which is nan. As a result, this column is not filled in, we then delete it.

License

Moreover,license holds more than half of its content as missing values. Being a column bearing few meaning we choosed to delete it.

last_review

In addition, a relevant default date cannot be specified. We choosed to delete it.

Replacement of missing data with default values

As we can see in the following output, in our new, cleaned data set, we no longer have any missing data:

Outliers processing

An outlier of a dataset is defined as a value that is more than 3 standard deviations from the mean.

Two features are not displayed in the table above: "accommodates" and "bedrooms"

The variable 'price' (target variable)

WARNING: We observe abnormally high prices for the categories "Entire_home/Apt" and "private_room". These outliers on the target variable prevent us from having a good Machine Learning model. We need to remove them.

NOTE: The outliers impact the value of the mean and standard deviation of the values. As our outlier selection algorithm uses these two quantities and as we perform only one selection step, there will be outliers after our treatment but the maximum price obtained will not penalize future processing.

NOTE: The average price and the distribution of prices around this average value are very different according to the variable room_type

Univariate Analysis

Profiling the dataset df

Target variable description

As expected, the most expensive room are the houses where the entire home is available. We don't have much data on the two last categories (shared room and hotel room) so we can't say much about it.

This plot show us the prices according to the neighborhood. The location has a significant impact on the price: for example, a room in Gobelin is cheaper than a room in Luxembourg

Correlation of 'price' with other variables

The coefficient of determination (squared correlation) is calculated to ignore the sign of the value

We note that the variable 'price' is mainly correlated to the variables 'accommodates', 'bedrooms', 'beds' and 'availability_*'. We will therefore reduce the selected variables to improve the readability of the matrix of coefficients of determination.

Feature Engineering

Preprocessing

Train and test data

We keep the proportions of each type of rooms in the train and test datasets

Evaluation of regression models

Selection of metrics

To quantify the performance of a regression model different evaluation metrics are possible. We have chosen to use three classical metrics.

The Determination coefficient ($R^2$): It is the square of the Pearson correlation between the true values $y$ and the predicted values $\widehat y$. In regression, the $R^2$ coefficient of determination is a statistical measure of how well the regression predictions approximate the real data points. This value is also linked to the Explained Variance as $R^2 = 1 - Var_{explained}$.

The Explained Variance is defined as follows: $$ Var_{explained}(y, \widehat y) = \frac{\sum_{i=1}^N (y_i - \widehat y_i)^2 }{\sum_{i=1}^N (y_i - \bar y)^2} \enspace , $$ where $\bar y$ is the mean value of $y$.

An $R^2$ of 1 indicates that the regression predictions perfectly fit the data. A high $R^2$ is necessary for accurate predictions but a high $R^2$ does not always guarantee that the model is perfect.

The Mean Absolute Error (MAE): this popular metric for regression use the $\ell_1$-norm to quantify the difference between the predictions and the true target i.e. $$ MAE(y, \widehat y) = \frac{1}{N}\sum_{i=1}^N |y_i - \widehat y_i| \enspace . $$ Interpretation of the MAE is easy because it is represented in the same units as the original data. A perfect model produces an MAE equal to zero and the closer the observed MAE is to zero, the better the model fits the data.

Depending on the application of the regression problem, it might be necessary to use a metric which quantifies the relative error instead of the absolute error. There exists various ways to define such a metric, in particular:

The Mean Absolute Pourcentage Error (MAPE): This metric computes the error between the predicted value and the target, and rescales it compared to the value of the target $y$ i.e. $$ MAPE(y, \widehat y) = \frac{1}{N}\sum_{i=1}^N \left| \frac{y_i-\widehat y_i}{y_i}\right|~. $$
This means that the same error of 1 won't have the same effect depending on the fact that $y=1$ or $y=1000$, which can be useful in our case where the rental prices of the apartments have a great amplitude although we have removed the biggest outliers.

NOTE: We will calculate the 3 metrics $R^2$, MAE and MAPE for each model tested on the Airbnb data but we will select the best model by using the MAPE metric that we feel is most appropriate for evaluating the prediction of apartment rental prices

Selection of best models

Cross validation

Selecting best features

Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable or output in which you are interested in.

Having irrelevant features in your data can decrease the accuracy of the models and make your model learn based on irrelevant features.

$\textbf{Benefits of performing feature selection}$

$\textbf{Method}$: Feature importance is computed using permutation importance method for the top-five models. We then calculate an average feature importance for all models and remove the features with the lowest importance.

Feature importance

NOTE: We want to eliminate features that have an almost zero contribution on the price prediction on the test dataset. This is why we take a very low threshold (0.01) which eliminates 4 features.

Removal of the least important features

Re-evaluation of regression models on the new data set

NOTE: The removal of 4 features "beds", "review_scores_checkin", "number_of_reviews_l30d", "review_scores_communication" has no impact on the performance of the best regression models.

Model optimisation

Searching the best hyper-parameters for the top-4 models

HistGradientBoosting

LGBM

GradientBoosting

xgboost

The final model

Visualize prediction

Explainability

Interpretability is defined as the ability for a human to understand the reasons for a model’s decision. This criterion has become preponderant for many reasons:

Importance of variables extracted directly from the Model

Explanation of each rental price prediction with SHAP values

The goal of SHAP (SHapley Additive exPlanations) is to explain the prediction of a data instance x by computing the contribution of each feature to that prediction.

The SHAP explanation method computes Shapley values from game theory. The feature values of a data instance act as players in a coalition. The Shapley values tell us how to fairly distribute the "payoff" (= prediction) among the features. A player can be an individual feature or a set of features.

Warning: SHAP consumes a lot of CPU time. It should therefore be used on a reduced data set. We will analyze an example of the rental price of an apartment of each type predicted on the test data set.

Selection of rooms with a median price by type

Plot the SHAP values for the selected rooms

Conclusion

Synthèse des résultats obtenus

Convert notebook to HTML